70 ◾ Bioinformatics
sequence in the current record. The “HI” tag shows the index of the query hit. The “AS” tag
shows the alignment score defined by the aligner. The “NM” tag shows the edit distance,
which is defined as the minimal number of single-nucleotide edits (substitutions, inser-
tions, and deletions) needed to transform the read sequence into the aligned segment of
the reference sequence.
For more details about SAM file, read the specification of the Sequence Alignment/Map
file format, which is available at “https://samtools.github.io/hts-specs/”.
2.3.2 Read Aligners
There are several aligners available for mapping reads to a reference genome. However,
next we will discuss only BWA [9], Bowtie [17], and STAR [8] as examples. The use of other
aligners is similar.
When we move to read mapping step, we will have already had cleaned the raw reads
in FASTQ files as discussed in Chapter 1. In the following sections, we will show you
how to map reads contained in FASTQ files to a reference genome. For this purpose, we
will download FASTQ files (run # is SRR769545) from the NCBI SRA database. To avoid
repeating the quality control steps, we will assume that we have cleaned the FASTQ files
following the steps in Chapter 1 and the files are ready for mapping. Our example raw data
are paired-end reads from the 1000 Genomes whole exome sequencing of an individual
from the Great Britain population. We can download the two FASTQ files (forward and
reverse) using the SRA-toolkit. The following commands create the directory “data” where
the two FASTQ files will be downloaded:
mkdir data
cd data
fasterq-dump --verbose SRR769545
The size of each file is 11G; the two files take around 22G of storage space. To save storage
space, we can compress these files using gzip utility, which will reduce each file to only
2.6G. Most aligners accept the gzipped FASTQ files.
gzip SRR769545_1.fastq
gzip SRR769545_2.fastq
The “.gz” will be added to the name of each file to indicate that the two files were com-
pressed with gzip.
We can also run FastQC and display the QC reports as follows:
fastqc SRR769545_1.fastq.gz SRR769545_2.fastq.gz
firefox SRR769545_1_fastqc.html SRR769545_2_fastqc.html
The per base quality reports for the two FASTQ files are shown in Figure 2.17. We can
notice that the reads in the two files have a good quality.